Tests of fit for the power function lognormal distribution

In this study, tests of fit for the power function lognormal distribution is considered. The probability plot, probability plot correlation coefficient, and goodness-of-fit tests—the Kolmogorov–Smirnov (KS), Cramér–von Mises (CvM), and Anderson–Darling (AD) tests are provided. Tables of critical values are presented by using simulation techniques, and the AD test outperforms KS and CvM tests based on power comparisons. Finally, to illustrate these test procedures, we fit this distribution to the data which represent the survival times of 121 breast cancer patients from one hospital.


Introduction
Statistical distributions are important for modeling and predicting real-world situations.For example, the empirical analysis of the distribution of income amounts is a major topic of research in development economics.One of the primary purposes of this analysis is simply to describe the distribution of income and derive descriptive and summative inequality measures, such as the Gini coefficient.A large number of income distributions have been proposed in the statistical literature, including the lognormal, gamma, Pareto, Weibull, Dagum, Singh-Maddala, and generalized beta-2 distributions [1].
Another type of income distribution, the power function lognormal composite (PFLC) distribution, has recently been proposed [2].This flexible distribution can have positive or negative skewness and can be either leptokurtic or platykurtic, depending on the parameters.Extensive statistical inference methods have also been presented [2], such as those for modeling household income and automobile insurance claims.Moreover, it has been demonstrated that inequality measures, including the Gini coefficient, generalized entropy index, Theil's entropy index, and the Atkinson, Bonferroni, and Zenga indexes, can be obtained using numerical methods based on the PFLC distribution [3].
The modeling and analysis of lifetime data are important aspects of statistical work in areas such as engineering, medicine, and the biological sciences.Lifetime models are often skewed or are far from normal.Therefore, models such as the exponential, Weibull, lognormal, and gamma often occupy a central position because of their demonstrated usefulness in a wide range of situations.Over the years, many models have been developed and applied to lifetime data, including the exponentiated generalized linear exponential distribution [4], generalized transmuted-G family [5], modified beta transmuted exponential distribution [6], extended gumbel distribution [7], and generalized Marshall-Olkin exponentiated exponential distribution [8].We will demonstrate that the PFLC distribution can also be applied to the modeling of lifetime data.
Goodness-of-fit (GoF) usually refers to whether a dataset is consistent with sampling from a model for a distribution.Many GoF tests exist, and they can generally be classified as either graphical techniques or statistical methods.Statistical methods are usually preferred because of their objectivity.One frequently used graphical method is the probability plot [9], which generally uses special scales on which the cumulative distribution function (CDF) of a particular distribution plots as a straight line.A normal probability plot is defined as a plot of the ithorder statistic versus some measure of location of the ith-order statistics from a standard normal distribution.The probability plot correlation coefficient, the product moment correlation coefficient that measures the degree of linear association between these two random variables, can be used as an appropriate test statistic [10].The Kolmogorov-Smirnov (KS), Crame ´r-von Mises (CvM), and Anderson-Darling (AD) tests are but a few of the traditional statistical tests available to determine GoF.These are all based on the empirical distribution function (EDF).They test the null hypothesis by measuring the distance between the EDF estimated from observed data and the CDF of the fitted models [11].
We introduce and implement probability plot, probability plot correlation coefficient, and GoF tests for the PFLC distribution.The first two methods are based on a transformation of the cumulative distributions.The classical KS, CvM, and AD tests are considered.A power study was conducted to investigate their performance, and the results show that the AD test outperforms the KS and CvM tests, and that the KS test is the weakest, requiring a much larger sample size to achieve comparable power to the other tests.We also show that the PFLC distribution can provide a good fit for one survival dataset.
The rest of this paper is organized as follows.The setup and statistical properties of the PFLC distribution are discussed in Section 2. The probability plot method and probability plot correlation coefficient are investigated in Sections 3 and 4, respectively.The derivations of computing formulae, critical values, and power comparisons for three PFLC GoF test statistics follow in Section 5.An application to one survival time is presented in Section 6.Some conclusions are drawn in Section 7.

Power function lognormal composite distribution
The probability density function (PDF) of the PFLC distribution can be written as [2] where w is a mixing weight, defined as and F(.) is the CDF of the standard normal distribution.PFLC is a distribution in three unknown parameters-α > 0, θ > 0, and σ > 0-and X~PFLC(α,θ,σ) indicates that X follows this distribution.
The corresponding CDF F(x), and the quantile function X(p) are given by and Let X~PFLC(α,θ,σ).Then Y = w(X/θ) α has a density function From (2), one can easily verify that w is a decreasing function of ασ.Table 1 presents the values of ασ for a grid of w values [0.01(0.01)0.99],which are accurate to about six significant digits.Thus f(y) in (5) can also be seen as a function of y when w is given.The CDF F(y) and quantile function Y(p) are given by and We will refer to distribution (5) as the standard power function lognormal composite distribution, denoted by SPFLC(w).

Probability plot
Taking the logarithm of (4), we can obtain Note that when p = w, lnθ + 1/α ln(p/w) = lnθ + 1/α ln(p/p) = lnθ, and lnθ Eq (8) can be written as q can be considered to be equal to the following formula: Thus, (9) represents a linear relationship between q and lnx, with an intercept of −lnθ and a slope of 1.
The probability plot of 100 sets of simulated data from PFLC(5,10,0.08) with random seed 1234 is shown in Fig 1, where the solid line AB is a probability plot drawn according to (8).The starting point A ln is the intersection of lnx = ln(x min ) and AB, where ln (x min ) is the natural logarithm of the minimum value of the simulated data.The ending point is the intersection of lnx = ln(x max ) and AB, where ln(x max ) is the natural logarithm of the maximum value of the simulated data.Point E(lnθ, 0) is the intersection of lnx = lnθ and q = 0, which corresponds exactly to p = w.In this example, ln(x min ) = 1.7780, ln (x max ) = 2.4514, and The line q = 0 divides AB into two parts.For the upper part BE, q Many authors have discussed methods for choosing the values of p for a given n for use in such plots [9].We use the Hazen formula, We determine whether the PFLC distribution can be used to fit one dataset using a probability plot as follows: (i) Order the sample values to obtain (ii) Obtain the estimated parameters â; ŷ; ŝ � � for (α, θ, σ) by maximum likelihood, and then (iii) Compute q from (10), where p i ¼ iÀ 0:5 n ; i ¼ 1; 2; � � � ; n. (iv) Plot q versus lnx; the PFLC distribution can be used to model the dataset if the plot is approximately a straight line.

The probability plot correlation coefficient
The probability plot correlation (PPC) coefficient was used as a test statistic for normality [10].The PPC test measures the linearity of a probability plot; if the probability plot is expected to be almost linear, then the correlation coefficient will be close to one.We then derive the PPC test for the PFLC distribution.
Taking the logarithm of (7), we obtain Letting Z i = lnY i , then T 1i , T 2i are defined as Then, the correlation coefficient r Q between Z i and T i can be defined as Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Because r Q only depends on w, we can obtain the critical values of this statistic for a given w in practice.Monte Carlo studies have determined the percentage points of the statistics for sample sizes n = 5(5)90, 90(10)100, 100(50)500, and 500(100)1000.For each case, the procedure was repeated 20 000 times to produce an empirical distribution of the test statistic, from which sample quantiles approximating the critical values were obtained.For w = 0.1, the algorithm to obtain the 5% critical values is as follows.
(ii) Compute m, which is the number of y i values less than w, and compute percentile point p i from (11).
(iv) Compute the correlation coefficient r Q between Z i and T i from ( 14).
(v) Repeat steps (i)-(iv) 20000 times to obtain the 5% sample quantiles of r Q as the 5% critical values.
Table 2 presents the 5% critical values of the distribution of r Q for selected sample sizes when w equals 0.1(0.2)0.9.For example, the critical value of r Q for n = 10 is 0.910701 when w is 0.1; this means that in 10% of random samples of size 10, the correlation coefficient will be at least 0.910701.
We can determine whether the PFLC distribution can be used to fit one dataset by the correlation coefficient at 5% significance level as follows: (i) Order the sample values to obtain (ii) Obtain the estimated parameters â; ŷ; ŝ � � for (α, θ, σ) by maximum likelihood, and calcu- (iii) Calculate Z i = lny i , where (iv) Calculate r Q from (14).
(v) Reject H 0 (the sample is from a PFLC distribution) at 5% significance level if r Q is less than the 5% critical values.

Goodness-of-fit tests
We discuss three GoF for the PFLC distribution [11].As described in Section 2, the test for X~PFLC(α,θ,σ) is equivalent to that for Y~SPFLC(w).Therefore we test whether the underlying probability distribution is SPFLC(w) for a given random sample Y 1 , Y 2 , � � �, Y n .

Basic GoF test statistics
Let y 1 , y 2 , � � �, y n denote the set of the original data in ascending order.The test statistic for the KS test is thus where The CvM test statistic is and the AD test statistic is Note that the estimate F of F is obtained by substituting the estimated parameter ŵ for w in (6).

Critical values
When all three parameters are unknown, the problem is reduced to a testing whether the y values have distribution (5).Because the distribution of y is only related to w, we can obtain the critical values of three GoF tests under a given w in practice [11].Critical values for the GoF test statistics are obtained similarly as those for the PPC test.For w = 0.057, the algorithm to obtain the 5% critical values for the GoF test statistics is as follows: (i) Set w = 0.057, generate n random numbers from (7), and order the sample values to obtain (ii) Obtain the estimated parameters ŵ for w by maximum likelihood.(iv) Repeat steps (i)-(iii) 20 000 times.Then, obtain the 5% sample quantiles of D, W 2 , and A 2 as the 5% critical values.
Table 3 presents the critical values of the distribution of D for selected sample sizes at five significance levels when w equals 0.057.For example, at a 5% significance level, the critical value of D for n = 10 is 0.408884; this means that in 5% of random samples of size 10, the maximum absolute deviation between the sample and population cumulative distributions will be at least 0.408884.Tables 4 and 5 present the respective critical values of W 2 and A 2 .
For the GoF tests, the null hypothesis is that the sample comes from a PFLC distribution, and the alternative hypothesis is that it does not.We can determine whether the PFLC distribution can be used to fit one dataset by GoF tests at 5% significance level as follows: (i) Order the sample values to obtain (ii) Obtain the estimated parameters â; ŷ; ŝ � � for (α, θ, σ) by maximum likelihood, and calcu- (iv) Calculate GoF test statistics D, W 2 , and A 2 using (15)-(17), respectively.
(v) Reject H 0 (the sample is from a PFLC distribution) at 5% significance level if the statistic exceeds the 5% critical values.

Power comparison
The power of a test is the probability that it will reject the null hypothesis if the alternative hypothesis is true (hence, power is the complement of the probability of a Type II error).Therefore, the power depends on the alternative distribution.A Monte Carlo study was performed to evaluate the power of the three GoF tests using 100 000 samples of different sizes from four alternative distributions.For these tests, the null hypothesis was that generated observations were drawn from a PFLC distribution (4,5,0.42).Note that the mixing weight w is 0.057 for PFLC(4,5,0.42), and the 5% critical values are from Tables 3-5.Simulations were carried out in MATLAB and all the codes used can be found from supporting information.Four alternative distributions were considered: Gamma (3.5, 2.7), χ 2 (10), LN(2.3, 0.5), and Weibull(10, 2), whose PDFs are as follows: The plots of five distributions are shown in Fig 3, from which it can be seen that PFLC (4,5,0.42)has the largest mode.For the left side of the mode of PFLC(4,5,0.42), all of the distributions have a slight discrepancy, while for the right side, it is difficult to distinguish them from the PDFs.Thus, these five distributions exhibit similar overall shapes.We note that if the shapes of these five distributions differ significantly, then the results will be highly unreliable.Table 6 summarizes the simulated powers of a PPC test for four selected distributions at the 5% significance level, which can be seen to increase with the sample sizes for the same alternative distribution.
Table 7 summarizes the simulated power for four selected distributions at the 5% significance level.From Table 7, the following can be seen: (i) The powers of the tests increase with the sample sizes.
(ii) The AD test outperforms the KS and CvM tests across different sizes and alternative distributions.
(iii) The KS test is the weakest test, and it requires a much larger sample to achieve comparable power to the other two tests.
(iv) The powers of the tests are the smallest when the alternative distribution is LN(2.3,0.5), which is the most similar to PFLC(4,5,0.42), as can be seen from Fig 3.
Hence, we recommend the AD test, followed by the CvM and KS tests.

Real data analysis
We consider an example to illustrate the methods discussed in Sections   Table 8 shows the summary statistics for the BREAST data.The mean is 46.3289, and the median is 40.00.Notice that the mean is greater than the median, indicating that the distribution of data is likely skewed to the right.
Fig 4 shows a histogram of these data, whose left side seems "chopped off" compared to the right side, which we describe as being skewed to the right.
In this study, the PFLC maximum likelihood estimates are â ¼ 1:05; ŷ ¼ 34:10, and ŝ ¼ 0:5749.Fig 5 presents the probability plot of the BREAST data, whose points approximate a straight line, which means that the PFLC distribution can provide a good fit to these data.Table 9 reports the test statistics and p-values (in parentheses) for the PPC and three GoF tests.It indicates that the PFLC distribution provides a reasonable fit to the BREAST data.

Conclusions
Goodness-of-fit testing is a key procedure for selecting the statistical distribution that best fits observed data.We performed tests of fit for the PFLC distribution, including the probability plot, probability plot correlation coefficient, KS test, CvM test, and AD test.Among these methods, the probability plot is a graphical method, while the others are statistical methods.We described the probability plot and considered the procedures, algorithms, and critical values for the other methods.We found that the AD test outperformed KS and CvM tests based on power comparisons.Moreover, this new PFLC distribution was first successfully used to model the survival times of breast cancer patients.The tests developed in this paper revealed that PFLC fits the data well.The work in this paper can be extended in some ways, such as to: and for the lower part AE, q = 1/α [lnp − lnw].The straight line p = 0.5841 also divides AB into two parts: p < 0.5841 and p > 0.5841.

Fig 6
Fig 6 compares the fitted and empirical PDF of the BREAST data, and Fig 7 compares the fitted and empirical CDF of the BREAST data.It can be seen that the PFLC CDF exhibits a good match to the empirical CDF, but there is a slight discrepancy between the empirical and PFLC PDFs.Table9reports the test statistics and p-values (in parentheses) for the PPC and three GoF tests.It indicates that the PFLC distribution provides a reasonable fit to the BREAST data.